Fixing porter can strand parcels until daemon restart#2132
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a bug where sweep parcels could become stranded in the chain porter if a failure occurred after the initial response was sent to the caller. It introduces a robust retry mechanism for specific post-delivery states and optimizes how sweeper-originated transactions interact with the porter to prevent unnecessary broadcasts. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
214b5d6 to
b7633c0
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a retry mechanism for the chain porter to handle recoverable post-delivery failures, such as those occurring during transaction confirmation or proof transfer. It also refactors the state machine to ensure local addresses are imported even when anchor transaction broadcasts are skipped. Review feedback highlighted a potential deadlock in the advanceState goroutine when handling background parcels with unbuffered error channels. Other improvements suggested include clearing retry bookkeeping upon permanent failure to prevent memory leaks and implementing a maximum retry limit to avoid infinite loops.
97bc49d to
e1b2466
Compare
e1b2466 to
76f2208
Compare
kaldun-tech
left a comment
There was a problem hiding this comment.
I think the change here is solid and addresses the failure case. Some minor suggestions related to improving documentation and adding a test case for shutdown
| daveSweepTxHash := daveSweepBlocks[0].Transactions[1].TxHash() | ||
| daveSweepTxHash := resolveMinedTransferTxid( | ||
| t.t, dave, daveSweepBlocks[0], | ||
| ) |
There was a problem hiding this comment.
We could use more context on the design decisions here. The old code failed with RBF (replace-by-fee) because it assumed the coinbase transaction at index 0, and sweep transaction is at index 1 in the block. And there's only one non-coinbase transaction in the block. The RBF sweep transactions invalidated this, or if multiple transactions are mined in the same block.
The new approach with resolveMinedTransferTxid finds which transaction in the block tapd know is a transfer. So decouples the tests from block transaction ordering, RBF replacements, multiple transactions in a block and tapd timing issues.
|
|
||
| closeStream, _, err := net.CloseChannel(local, chanPoint, false) | ||
| require.NoError(t.t, err) | ||
|
|
There was a problem hiding this comment.
Moving the close after the SubscribeSendEvents fixes a race condition:
CloseChannelstarts the close process- tapd processes close, fires
SendStateCompleteevent SubscribeSendEventscalled — but event already firedwaitForSendEventblocks forever waiting for missed event
By subscribing first, all events from the close operation are captured. This implements the standard "subscribe before action" pattern
So it's good we could use more comments IMO
| pollInterval = 200 * time.Millisecond | ||
| ) | ||
|
|
||
| require.Eventually(t, func() bool { |
There was a problem hiding this comment.
This will get us more detailed failure messages about the state and make the testing more robust
| blockHashSet bool | ||
| blockHeight uint32 | ||
| blockHeightHint uint32 | ||
| pollInterval = 200 * time.Millisecond |
There was a problem hiding this comment.
Minor: 200 millis gets used in a few different spots, consider intorducing a constant
| case SendStateStorePostAnchorTxConf: | ||
| fallthrough | ||
| case SendStateTransferProofs: | ||
| return true |
There was a problem hiding this comment.
Minor: The fallthrough style is unusual. More idiomatic:
switch state {
case SendStateWaitTxConf, SendStateStorePostAnchorTxConf, SendStateTransferProofs:
return true
default:
return false
}
Or even simpler:
return state == SendStateWaitTxConf ||
state == SendStateStorePostAnchorTxConf ||
state == SendStateTransferProofs
| if pkg == nil || pkg.OutboundPkg == nil || | ||
| pkg.OutboundPkg.AnchorTx == nil { | ||
|
|
||
| return |
There was a problem hiding this comment.
Invariant: we can only add to postDeliveryRetryAttempts if AnchorTx is non-nil, and only need to delete it in the same case
|
|
||
| return nil, fmt.Errorf("unable to import local "+ | ||
| "addresses: %w", err) | ||
| } |
There was a problem hiding this comment.
This block imports the taproot output keys into lnd's wallet for outputs that belong to us. So the fix ensures addresses are always imported, regardless of who broadcasts.
It's a solid change. Think it would benefit from more comments.
| "disk: %w", err) | ||
| } | ||
|
|
||
| ctx, cancel = p.WithCtxQuitNoTimeout() |
There was a problem hiding this comment.
Gives us a new context for importLocalAddresses which is idempotent and can be retried. As opposed to the CtxBlocking context (line 2189) for LogPendingParcel which is a critical write to disk that must complete even during shutdown.
| } | ||
| } | ||
|
|
||
| func TestAdvanceStatePermanentFailureClearsRetryBookkeeping(t *testing.T) { |
There was a problem hiding this comment.
Nit: The name suggests "permanent failure after max attempts". It actually tests non-recoverable state failure
| t.Fatalf("did not expect pending parcel re-queue") | ||
| case <-time.After(100 * time.Millisecond): | ||
| } | ||
| } |
There was a problem hiding this comment.
Minor: Consider adding a test case for process shutdown similar to
`func TestSchedulePostDeliveryRetryShutdownDuringDelay(t *testing.T) {
porter := newTestChainPorter()
pkg := newTestSendPackage(SendStateTransferProofs)
recoverable := porter.schedulePostDeliveryRetry(
pkg, SendStateTransferProofs, errors.New("failure"),
)
require.True(t, recoverable)
// Close quit before timer fires (base delay is 1 second)
close(porter.Quit)
// Verify goroutine exits cleanly, no re-queue
select {
case <-porter.outboundParcels:
t.Fatalf("did not expect parcel after shutdown")
case <-time.After(1500 * time.Millisecond):
// Success
}
}`
|
@sergey3bv, remember to re-request review from reviewers when ready |
Should close #2120